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In this paper we discuss the threat of malware targeted at extracting information about the relationships in a 
real-world social network as well as characteristic information about the individuals in the network, which we 
dub Stealing Reality. We present Stealing Reality (SR), explain why it differs from traditional types of network 
attacks, and discuss why its impact is significantly more dangerous than that of other attacks. We also present 
our initial analysis and results regarding the form that an SR attack might take, with the goal of promoting the 
discussion of defending against such an attack, or even just detecting the fact that one has already occurred. 
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I. INTRODUCTION 

History has shown that whenever something has a tangible 
value associated with it, there will always be those who try 
to malevolently 'game' the system for profit. These days, the 
field of social networks is experiencing exponential growth 
in popularity while in parallel, computational social science 
[1] and network science [2-A] are providing real- world ap- 
plicable methods and results with a demonstrated monetary 
value. We conjecture that the world will increasingly see mal- 
ware integrating tools and mechanism from network science 
into its arsenal, as well as attacks that directly target human- 
network information as a goal rather than a means. Paraphras- 
ing Marshall McLuhan's "the medium is the message," we 
have reached the stage where, now, "the network is the mes- 
sage." 

Social networking concepts could be discussed both in the 
context of malware's means of spreading, as well as in the 
context of its target goal. Many existing viruses and worms 
use primitive forms of 'social engineering' [5] as a means of 
spreading, in order to gain the trust of their next victims and 
cause them to click on a link or install an application. For ex- 
ample, 'Happy99' was one of the first viruses to attach itself to 
outgoing emails, thus increasing the chances of having the re- 
cipient open an attachment to a seemingly legitimate message 
sent by a known acquaintance [6]. Sometimes the malware's 
originators use similar techniques to seed the attack. A more 
recent example is 'Operation Aurora', a sophisticated attack 
originating in China against dozens of US companies during 
the first half of 2009, where the attack was initiated via links 
spread through a popular Korean Instant Messaging applica- 
tion [7]. Nevertheless, the current discussion focuses more on 
the second context — in which the human network structure 
itself is the goal of the attack. 

When discussing the goal of learning a network's structure, 
it is important to distinguish between the "technical" topol- 
ogy of a digital network and the actual topology of the human 
network that communicates on top of it — which is what we 
are actually interested in. Technically, every phone or com- 
puter can reach nearly any other on the planet, but in practice 
it will only contact a small subset, based on the context of its 
user. Many existing network attacks gather information on the 
digital network topology, usually in order to leverage the at- 
tack itself. Some attacks, for example, make use of an email 
program's address book or a mobile phone's contact list to 



spread further. In the context of Stealing Reality, this method 
is not as useful, since a majority of peers would not be con- 
tacted on a routine basis. There is a great deal of informa- 
tion in the patterns of communication exercised by the user 
with his peers. These patterns are affected by many factors of 
relationship and context, and could be used in reverse — to 
infer the relationship and context. In addition the communi- 
cation patterns, combined with other behavioral data that can 
be harvested from mobile devices, could serve to teach a great 
deal of information about the user himself — their age, their 
occupation and role, their personality, and a great deal more. 
This type of information could be summarized as a "rich iden- 
tity" profile of a person [8], which is much more informative 
than direct demographic information which is currently used 
to profile users, and could be very valuable to advertisers and 
spammers, for example. 

Expanding from an individual's egotistical network, the so- 
cial network as a whole has intricate relationships and topolo- 
gies among cliques and sub-groups, which may be both over- 
lapping as well as residing in multiple hierarchies. This is 
complicated even more by issues of like trust or influence. 
The fact that three people know each other does not necessar- 
ily mean that information received by one will propagate in 
the same format to the two peers, if at all. Computational so- 
cial science has shown that many of these aspects of a social 
network could be learned and extracted from communication 
patterns [8]. 

In this paper we discuss the ability to steal vital pieces of 
information concerning networks and their users, by a non- 
aggressive (and hence — harder to detect) malware agent. We 
analyze this threat and build a mathematical model capable 
of predicting the optimal attack strategy against various net- 
works. Using data from real-world mobile networks we show 
that indeed, in many cases a "stealth attack" (one that is hard 
to detect, however, and steals private information at a slow 
pace) can result in the maximal amount of overall knowledge 
captured by the operator of this attack. This attack strategy 
also makes sense when compared to the natural human so- 
cial interaction and communication patterns, as we discuss in 
our concluding section. The rest of the paper is organized as 
follows: Sections II and III expand on the motivation behind 
reality stealing attacks and their dangers. Section IV describes 
the threat model and its analysis, while Section V presents our 
preliminary empirical results. Concluding remarks are given 
in Section VI. 
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FIG. 1: The evolution of As as a function of the overall percentage of edges learned, for networks of same number of edges, but different 
values of Kolmogorov complexity. 



II. MOTIVATION FOR STEALING REALITY 

Many commercial entities have realized the value of in- 
formation derived from communication and other behavioral 
data for a great deal of applications, like marketing cam- 
paigns, customer retention, security screening, recommender 
systems, etc. There is no reason to think that developers of 
malicious applications will not implement the same methods 
and algorithms into future malware, or that they have not al- 
ready started doing so. 

There already exist secondary markets for resale of this type 
of information, such as inf ochimps . com, or black market 
sites and chat-rooms for resale of stolen identity information 
and other illegal data sets [9]. It is reasonable to assume that 
a social hub's email address would worth more to an adver- 
tiser than an edge node. It is also reasonable to assume that 
a person meeting the profile of a student might be priced dif- 
ferently than that of a corporate executive or a homemaker. 
There are already companies operating in the legal grey area, 
which engage in the collection of email and demographic in- 
formation with the intention of selling it [10]. Why work hard 
when one can set loose automatic agents that would collect 
the same if not better quality information? Wang et al. predict 
that once the market share of any specific mobile operating 
system reaches a computable phase transition point, viruses 
could pose a serious threat to mobile communications [11]. 

One might also imagine companies performing this types 
of attacks on a competitor's customers (to figure out which 
customers to try and recruit), or even operations performed by 
one country on another. Finally, the results of an SR attack 
might be later used for selecting the best targets for future 



attacks or configuring the 'social engineering' components of 
other attacks. 



III. WHY STEALING REALITY ATTACKS ARE SO 
DANGEROUS 

One of the biggest risks of real world social network in- 
formation being stolen is that this type is very static, espe- 
cially when compared to traditional targets of malicious at- 
tacks. Data network topologies and identifiers could be re- 
placed with the press of a button. The same goes for pass- 
words, usernames, or credit cards. An infected computer 
could be wiped and re-installed. An online email, instant mes- 
senger, or social networking account could be easily replaced 
with a similar one, and the users' contacts can be quickly 
warned of the original account's breach. 

However, it is much harder to change one's network of real 
world, person-to-person relationships, friendships, or family 
ties. The victim of a "behavioral pattern" theft cannot eas- 
ily change her behavior and life patterns. Plus this type of 
information, once out, would be very hard to contain. In addi- 
tion, once the information has been extracted in digital form, 
it would be quite hard if not impossible to make sure that all 
copies have been deleted. 

There are many stories in recent years of "reality" infor- 
mation being stolen and irreversibly be put in the open. In 
2008, real life identity information of millions of Korean cit- 
izens was stolen in a series of malicious attacks and posted 
for sale [7]. In 2007, Israel Ministry of Interior's database 
with information on all of the country's citizens was leaked 



3 




4 5 6 

Kolmogorov Complexity 



x 10 



10 

5 



FIG. 2: An illustration of the easily learnable network notion. The graph depicts the critical learning threshold A E for networks of 1, 000, 000 
nodes, as a function of increasing values of the Kolmogorov complexity. Notice how networks for which K E < max jo, \E\ — ]Jq E ^ j are 
easily learnable, while more complex networks require significantly larger amounts of information in order to be able to accelerate the network 
learning process. 



and posted on the Web [12]. Just these days, a court sill has 
to rule whether the database of bankrupt gay dating site for 
teenagers will be sold to raise money for repaying its credi- 
tors. The site includes personal information of over a million 
teenage boys [13]. In all of these cases, once the informa- 
tion is out, there is no way back, and the damage is felt for 
a long time thereafter. In a recent Wall Street Journal inter- 
view, Google CEO Eric Schmidt referred to the possibility 
that people in the future might choose to legally change their 
name in order to detach themselves from embarrassing "real- 
ity" information publically exposed in social networking sites 
[14]. Speculative as this might be, it demonstrates the sensi- 
tivity and challenges in recovering from leakage of real-life 
information, whether by youthful carelessnes or by malicious 
extraction through an attack. 



IV. THREAT MODEL 

In this section we describe and analyze the threat model. 
First, we define the attacker's goals in the terms of our model, 
and develop a quantitative measure for assessing the progress 
in achieving these goals. Then, we present an analytical model 
to predict the success rate of various attacks. Finally, we pro- 
vide an assessment for the best strategies for devising such 
an attack. We demonstrate both based on analytical mod- 
els as well as using real mobile network data, that in many 
cases the best attack strategy is counter intuitively a "low- 
aggressiveness attack". Besides yielding the best outcome for 
the attacker, such an attack may also deceive existing mon- 
itoring tool, due to its low traffic volumes and the fact that 
it imitates natural end-user communication patterns (or even 
"piggibacks" on actual messages). 



For this reason, Stealing Reality attacks are much more 
dangerous than traditional malware attacks. The difference 
between SR attacks vs. more traditional forms of attacks 
should be treated with the same graveness as nonconventional 
weapons compared to conventional ones. The remainder of 
this document presents our initial analysis and results regard- 
ing the form that an SR attack might take, in contrast to the 
characteristics of conventional malware attacks. 



A. Network Model 

We shall model the network as an undirected graph 
G(V, E). The difficulty of learning the relevant information 
of the network's nodes and edges may be different for differ- 
ent nodes and for different edges. In general, we denote the 
probability that vertex u was successfully "learned" or "ac- 
quired" by an attacking agent that was installed on u at time 
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FIG. 3: An observational study of the overall amount of data that can be captured as a function of p — the attack's aggressiveness. Notice the 
local maximum around p = 0.5 that is outperformed by the global maximum at p = 0.04. 



as pv{u 1 1). Similarly, we shall denote the probability that an 
edge e(u, v) was successfully learned at time t by an agent in- 
stalled on it at time as pe(u, t). We shall denote the presence 
of an attacking agent on a vertex u at time t by the following 
Boolean indicator: 



I u (t) — 1 iffu is infected at time t 

Similarly, we shall denote the presence of an attacking 
agent on an edge e(u, v) at time t as: 



I e (t) — 1 iff either u or v or both are infected at time t 

For each vertex u and edge e, let the times T u and T e denote 
their initial time of infection. 



B. Attacker's Goal: Stealing Reality 

As information about the network itself has become a wor- 
thy cause for an attack, the attacker's motivation is stealing 
as much properties related to the network's social topology as 
possible. The percentage of vertices-related information ac- 
quired at time t is therefore: 

Ay(t) = p 5^/ u (t) -Pv(u,t-T u ) 



^E{t) = ^-Y, I ^ t )-p E ^ t - T -) 

' ' e€E 

As an extension in the spirit of Metcalfe's [15] and Reed's 
Law [16], a strong value emerges from learning the "social 
principles" behind a network. Understanding essence behind 
the implied social network is more valuable (and also requires 
much more information in order to learn) as the information 
it encapsulates is greater. For example, let us imagine the fol- 
lowing two mobile social networks: 

1. For every two users Ui, Uj, the users are connected if 
and only if they joined the network on the same month. 

2. For every two users Ui, Uj, the users are connected in 



probability p = 



2' 



Similarly, the percentage of edges-related information ac- 
quired at time t is : 



It is easy to see that given a relatively small subset of net- 
work 1, the logic behind its social network can be discovered 
quite easily. Once this logic is discovered, the rest of the net- 
work can automatically be generated (as edges are added ex- 
actly for pairs of users who joined the network at the same 
month). Specifically, for every value of e we can calculate a 
relatively small number of queries that we should ask in or- 
der to be able to restore the complete network with mistake 
probability of 1 — e. However, for network 2 the situation is 
much different, as the only strategy for accurately obtaining 
the network is actually discovering all the edges it comprised 
of. 

Let us denote by Ke the Kolmogorov Complexity [17] of 
the network, namely — the minimal number of bit required in 
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FIG. 4: An extensive study of a real-life mobile network of 7, 706 nodes and 17, 404 edges. Each graph presents the performance of a Stealing 
Reality attack for a specific different set of values of a, p, a, M, n. The performance is measured as the percentage of information acquired, 
as a function of the infection rate p. The scenarios that are presented in this figure demonstrate a global optimum of the attack performance 
for very low values of p. In other words, for many different scenarios it is best to use a very non-aggressive attack, which would result in 
maximizing the amount of network information obtained. Values of a and j3 which had demonstrated this behavior were between 10 and 500. 
Values of n were between 0.1 and 100, whereas the values of a were between 0.1 and 12. The values of M were between 0.1 and 30. It is 
interesting to mention that for high values of a and /3, low values of M did display this phenomenon while high values of M did not. 



order to "code" the network in such a way that it could later 
be completely restored. As the number of vertices |V| is as- 
sumed to be known, the essence of the network is coded in 
its edges. Dividing the number of edges learned \E\AE(t) by 
the number of "redundant edges" \E\ — Ke yields the amount 
of information learned at time t. Following a similar logic of 
Reed's Law we shall evaluate the benefit of the learning pro- 
cess proportionally to the number of combinations that can be 
composed from the information learned. Normalizing it by the 
number of edges, we shall receive the following measurement 
for the social essence learned: 

E|A g (t)-|E| 

A s (t) = 2 i e i- k e 

The attacker in interested in maximizing the values of 
Ay(i), A E (t) and A s (t). The evolution of the A s , the so- 
cial essence of the network, as a function of the "complexity 
hardness" of the network is illustrated in Figure 1 . 

C. Attack Analysis 

We assume that the learning process of vertices and edges 
follows the well-known Gompertz function, namely: 



Ve t e£ , p E {e,t) = e 



Wu t €V , pv(u,t) = e -^~ TiW 

with a and j3 representing the efficiency of the learning mech- 
anism used by the attacker, as well as the amount of informa- 
tion that is immediately obtained upon installation, denotes 
the learning rate of each edge vertex, determined by the ac- 
tivity level of the edge vertex (namely — accumulation of 
new information). Variable is also used for normalizing a 
and j3. Hence, the attack success rates can now be written as 
follows: 

M*) = 1 4E I ».(*)- e " /, "" P4(t "^ ) 

1 1 UiEV 
1 1 e<€E 

Attacking agents spread through movements on network 
edges. Too aggressive infection is more likely to be detected, 
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causing the accumulation of information concerning the net- 
work to be blocked altogether. On the other hand, attack 
agents that spread too slowly may evade detection for a long 
period of time, however, the amount of data they gather would 
still be very limited. In order to predict the detection probabil- 
ity of attacking agents at time t we shall use Richard's Curve, 
for as follows : 



This expression can now be used for calculating the dis- 
tribution of initial infection times of vertices and edges. Note 
that information is gathered faster as infection rate p increases. 
However, so does the detection probability. The optimum can 
therefore be derived by calculating the expectance of the total 
amount of information obtained (in which the only free pa- 
rameter is p) : 



1 

Pdetect\t) — 3— 

(l + e -p(t-M)y° 

where p is the probability that an agent would copy itself 
to a neighboring vertex at each time step, a is a normalizing 
constant for the detection mechanism, and M denotes the nor- 
malizing constant for the system's initial state. Let N t denote 
the number of infected vertices at time t. Assuming that ver- 
tices infection by their infected neighbors is a random process, 
the number of infected vertices vertex u would have at time t 
is : 

deg(u) 

Nt 'ivr 

The probability that vertex u would be attacked at time t 
equals therefore at least: 

— AT n deg(u) 

PattackM = 1 - e ^^^ev^M 

and the expected number of infected vertices is : 



D. Obtaining the Social Essence of a Network 

Recalling the expression that represents the progress of 
learning the "social essence" of a network, we can see that 
initially each new edge contributes 0(1) information, and the 
overall amount of information is therefore kept proportional 
to 0(pjj )■ As the learning progresses and the logic principles 
behind the network's structure start to unveil, the amount of 
information gathered from new edges becomes greater than 
their linear value. At this point, the overall amount of infor- 
mation becomes greater than O(pjr), and the benefit of ac- 
quiring the social structure of the network starts to accelerate. 
Formally, we can see that this phase is reached when: 



N t+ At = \V\~Y, II^ 1 -Vattack{v,i)) 
vinV i=0 

The number of infected nodes therefore grows as : 



N t+At = \V\-J2U' 



AT. dcg(v) 
P 2|E 



vinV i—0 



From Nt we can now derive the distribution of the Boolean 
infection indicators : 



^ o( 1 - lEl * e h\e\) 



Let us denote by the Critical Learning Threshold, 
above which the learning process of the networks accelerates, 
as described above (having each new learned edge contribut- 
ing an increasingly growing amount of information concern- 
ing the network's structure), to be defined as follows: 



\E\ - K t 
\E\ 



H\E\) 



p[I u (t) = 1] = 



Nt_ 
\V\ 



Consequently, in order to provide as strong protection for 
the network as possible, we should make sure that for every 
value of t: 



p[I e (t) = 1] = 2-i 



N t N? 



■\v\ \V\* 

And the attack probability can now be given as follows : 

Pattack(u,t + At) = 



1 - e P 2 I E I 



l (-\V\+j: vev Ut=o^-P a »ack(v,i))) 



rj(t-T ei ) 



<\E\-(\E\-K E )ln(\E\) 



Alternatively, the attack would prevail when there exist a 
time t for which the above no longer holds. 

Notice that as pointed out above, "weaker" networks 
(namely, networks of low Kolmogorov complexity) are easy 
to learn using a limited amount of information. Generaliz- 
ing this notion, the following question can be asked : How 
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"simple" must a network be, in order for it to be "easily learn- 
able" (namely, presenting an superlinear learning speed, start- 
ing from the first edges learned)? 

It can be seen that in order for a network to be easily 
learnable, its critical learning threshold must equal O(l). 
Namely, the network's Kolmogorov complexity must satisfy: 

i- lEl lE * E H\E\)<o(i) 

We must obtain the following criterion for easily learnable 
networks: 



K E < \E\ 



\E\ 



H\E\) 



The notion of an easily learnable network is illustrated in Fig- 
ure 2, presenting the critical learning threshold for net- 
works of 1,000,000 nodes, as a function of the network's 
Kolmogorov complexity. 



V. EXPERIMENTAL RESULTS 

We have evaluated our model on aggregated call logs de- 
rived from a real mobile phone network, comprised of ap- 
proximately 200, 000 nodes and 800, 000 edges. These tests 
have clearly shown that in many cases, an "aggressive attack" 
achieves inferior results compared to more subtle attacks. Fur- 
thermore, although sometimes the optimal value for the infec- 
tion rate revolves around 50%, there are scenarios in which 
there is a local maximum around this value, with a global 
maximum around 4%. Figure 3 demonstrates the attack effi- 
ciency (namely, the maximal amount of network information 
acquired) as a function of its "aggressiveness" (i.e. the at- 
tack's infection rate). A global optimum both for the vertices 
information as well as for the edges information is achieved 
around 4%, with a local optimum around 52%. 



A more extensive simulation research was conducted for 
an arbitrary sub-network of this mobile network, containing 
7, 706 edges and 17, 404 edges. In this research we have ex- 
tensively studied the success of a Stealing Reality attack using 
numerous different sets of values (i.e. a, /?, r i7 a and M). Al- 
though the actual percentage of stolen information had varied 
significantly between the various sets, many of them had dis- 
played the same interesting phenomenon — a global optimum 
for the performance of the attack, located around a very low 
value of p. Some of these scenarios are presented in Figure 4. 



VI. CONCLUSIONS 

In this paper we discussed the threat of malware targeted at 
extracting information about the relationships in a real-world 
social network as well as characteristic information about the 
individuals in the network, which we name "Stealing Real- 
ity". We present Stealing Reality (SR), explain why it differs 
from traditional types of network attacks, and discuss why its 
impact is significantly more dangerous than that of other at- 
tacks. We also present our initial analysis and results regard- 
ing the form that an SR attack might take. We have evaluated 
our model on data derived from a real mobile network. Our 
results clearly show that an "aggressive attack" achieves in- 
ferior results compared to more subtle attacks. This attack 
strategy also makes sense when comparing it to natural hu- 
man social interaction and communication patterns. The rate 
of human communication and evolution of relationship is very 
slow compared to traditional malware attack message rates. A 
Stealing Reality type of attack, which is targeted at learning 
the social communication patterns, could "piggyback" on the 
user generated messages, or imitate their natural patterns, thus 
not drawing attention to itself while still acheiving its target 
goals. 
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